Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Recover interrupted dlc protocol on restart #1817

Merged
merged 8 commits into from
Jan 16, 2024

Conversation

holzeis
Copy link
Contributor

@holzeis holzeis commented Jan 15, 2024

This change introduces a naive mechanism to recover from an interrupted dlc protocol (e.g. lost connection or a restart) by persisting the hash of any received message and resending the last outbound dlc message on any reconnect.

  1. Upon connecting to the peer, the last send dlc message is looked up from the database (last_outbound_dlc_messages) and is naively resend to the counterparty (this is applicable for both sides).
  2. Once the counterparty receives the last message, it will check if it has already processed that message looking up the hash of the dlc message in the dlc_messages table.
  3. If it has already seen the message it will simply ignore it if not, it will continue processing it as it appears it must have missed it the first time, thus continuing an interrupted protocol.
  • All dlc messages are stored into the dlc_messages table once successfully processed
  • The last outbound dlc message is stored to the last_outbound_dlc_messages table. Note, that since a dlc message can get quite large we are only keeping the last message around so to not waste space.

In order to keep track of all outbound dlc messages I had to introduce the SendDlcMessage Event, which re-routes the sending over the app / coordinator layer who is then able to store the message firstly into the dlc_messages and last_outbound_dlc_messages before actually sending the message to the counter party. This way we ensure we are always able to resend the last message.

TODO (Not necessarily everything in this PR)

  • If the protocol got interrupted after processing an Offer, but did not process the accept yet. The protocol will still get stuck. feat: Recover from pending offer #1822
  • sync_position_to_subchannel should be removed, adapted for a dlc channel.
  • Upon restarting the app within an incomplete dlc channel state a catchup dialog should be shown.
  • Testing

Future work / Notes:

  • I had to provide a custom OnionMessageHandler (TenTenOneOnionMessageHandler) to get an reliable hook on when a peer connected. This seems to be far more reliable than to rely on the websocket connection and should be considered for triggering our rollover and expired position actions.
  • Since we are using different data sources with independent transactional systems it can happen that we are persisting things into the sled database (dlc), but not into the sqlite database (10101). This should get addressed eventually. e.g. by also storing the dlc data structures into the same sqlite database.

@holzeis holzeis self-assigned this Jan 15, 2024
@holzeis holzeis marked this pull request as ready for review January 15, 2024 12:47
@holzeis holzeis force-pushed the feat/dlc-channel-recovery-on-restart branch from d8694b6 to 5b35de5 Compare January 15, 2024 12:48
@holzeis holzeis changed the title chore: Adapt dlc channels api chore: Recover interrupted dlc protocol on restart Jan 15, 2024
@holzeis holzeis force-pushed the feat/dlc-channel-recovery-on-restart branch from 5b35de5 to eca91d1 Compare January 15, 2024 13:14
@holzeis holzeis changed the title chore: Recover interrupted dlc protocol on restart feat: Recover interrupted dlc protocol on restart Jan 15, 2024
@holzeis holzeis force-pushed the feat/dlc-channel-recovery-on-restart branch 2 times, most recently from e2d1549 to ce27de9 Compare January 15, 2024 14:13
Copy link
Contributor

@bonomat bonomat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice - good work. I think some code duplication can be avoided but otherwise this looks good to me.

coordinator/src/admin.rs Show resolved Hide resolved
crates/ln-dlc-node/src/node/connection.rs Show resolved Hide resolved
coordinator/src/bin/coordinator.rs Outdated Show resolved Hide resolved
crates/ln-dlc-node/src/dlc_message.rs Outdated Show resolved Hide resolved
coordinator/src/dlc_handler.rs Outdated Show resolved Hide resolved
coordinator/src/dlc_handler.rs Outdated Show resolved Hide resolved
crates/ln-dlc-node/src/tests/mod.rs Outdated Show resolved Hide resolved
mobile/native/src/db/dlc_messages.rs Outdated Show resolved Hide resolved
mobile/native/src/dlc_handler.rs Outdated Show resolved Hide resolved
Extending the rust-dlc storage api to also be able to get non-signed channels.
Publishes an event if a peer connected successfully using the onion message handler hook.
@holzeis holzeis force-pushed the feat/dlc-channel-recovery-on-restart branch from 30da175 to 876bdab Compare January 16, 2024 11:16
@holzeis holzeis requested a review from bonomat January 16, 2024 11:19
@holzeis
Copy link
Contributor Author

holzeis commented Jan 16, 2024

@bonomat I've addressed all your review remarks and pushed them as fixups on top of my changes. Please have another look.

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@bonomat bonomat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@holzeis holzeis force-pushed the feat/dlc-channel-recovery-on-restart branch from 876bdab to a724da4 Compare January 16, 2024 13:30
Upon connecting to the peer, the last send dlc message is looked up from the database (`last_outbvound_dlc_messages`) and is naively resend to the counterparty (this is applicable for both sides).

Once the counterparty receives the last message, it will check if it has already processed that message looking up the hash of the dlc message in the `dlc_messages` table.

If it has alrady seen the message it will simply ignore it if not, it will continue processing it as it appears it must have missed it the first time, thus continuing an interrupted protocol.

- All dlc messages are stored into the `dlc_messages` table once successfully processed
- The last outbound dlc message is stored to the `last_outbound_dlc_messages` table. Note, that since a dlc message can get quite large we are only keeping the last message around so to not waste space.

In order to keep track of all outbound dlc messages I had to introduce the `SendDlcMessage` Event, which re-routes the sending over the app / coordinator layer who is then able to store the message firstly into the `dlc_messages` and `last_outbound_dlc_messages` before actually sending the message to the counter party. This way we ensure we are always able to resend the last message.

Note, since we are using different data sources with independet transactional systems it can happen that we are persting things into the sled database (dlc), but not into the sqlite database (10101). This should get addressed eventually. e.g. by also storing the dlc data structures into the same sqlite database.
After e268c62, the order can't be copied anymore.
The node was still using its own implementation for sending dlc message, this was primarily needed because it also had to support `SubChannelMessages`. This change removes all subchannel and onchain protocol handlings from processing incoming dlc messages, as those protocols aren't used.

By switching to the common `NodeEvent::SendDlcMessage` we can rely on the outbound message being persisted during sending and can get rid of the `DlcMessageStep`.
Instead of processing the position state on the outbound dlc message. IMHO the logic is easier to understand that way and the code is more cohesive that way. (at least all the code that needs to be refactored is now in one place :)
@holzeis holzeis force-pushed the feat/dlc-channel-recovery-on-restart branch from a724da4 to 14a2169 Compare January 16, 2024 13:32
@holzeis holzeis added this pull request to the merge queue Jan 16, 2024
Merged via the queue into main with commit b4ef2c7 Jan 16, 2024
9 checks passed
@holzeis holzeis deleted the feat/dlc-channel-recovery-on-restart branch January 16, 2024 14:32
@holzeis holzeis linked an issue Jan 16, 2024 that may be closed by this pull request
luckysori added a commit that referenced this pull request Jan 18, 2024
By default this is set to 2MB, according to the docs[1].

We need larger bodies than before (even in the e2e test!), because we
added lots of CET adaptor signatures to the database recently[2].

I also played around with nginx's `client_max_body_size`, but that's
actually irrelevant since we don't use nginx as a reverse proxy for
the coordinator in the e2e tests.

Related to #1719.

[1]: https://docs.rs/axum/latest/axum/extract/struct.DefaultBodyLimit.html#method.disable.
[2]: #1817.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DLC-channel protocol recovery on restart
2 participants